Interleaved cache controllers with shared metadata and related devices and systems

ABSTRACT

Interleaved cache controllers with shared metadata are disclosed and described. A memory system may comprise a plurality of cache controllers and a metadata store interconnected by a metadata store fabric. The metadata store receives information from at least one of the plurality of cache controllers, a portion of which is stored as shared distributed metadata. The metadata store provides shared access of the shared distributed metadata hosted to the plurality of cache controllers

BACKGROUND

Computer and electronic devices have become integral to the lives ofmany and include a wide range of uses from social media activity tointensive computational data analysis. Such devices can include smartphones, tablets, laptops, desktop computers, network servers, and thelike. Memory systems and subsystems play an important role in theimplementation of such devices, and are one of the key factors affectingperformance. Accordingly, memory systems and subsystems are the subjectof continual research and development.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and advantages of the embodiments will be apparent from thedetailed description which follows, taken in conjunction with theaccompanying drawings, which together illustrate, by way of example,embodiment features; and, wherein:

FIG. 1 is a schematic view of an exemplary memory system;

FIG. 2 is a schematic view of an exemplary memory system;

FIG. 3 is a schematic view of an exemplary memory system;

FIG. 4 is a schematic view of an exemplary memory system;

FIG. 5 is a schematic view of an exemplary memory system.

FIG. 6 is a schematic view of an exemplary memory system;

FIG. 7A is a schematic view of an exemplary memory system;

FIG. 7B is a schematic view of an exemplary memory system;

FIG. 7C is a schematic view of an exemplary memory system;

FIG. 8A is a representation of an exemplary metadata entry;

FIG. 8B is a representation of an exemplary shared metadata entry;

FIG. 9 is a schematic view of an exemplary system; and

FIG. 10 is a representation of steps of an exemplary method of a memorysystem with shared metadata.

Reference will now be made to the exemplary embodiments illustrated, andspecific language will be used herein to describe the same. It willnevertheless be understood that no limitation on invention scope isthereby intended.

DESCRIPTION OF EMBODIMENTS

Although the following detailed description contains many specifics forthe purpose of illustration, a person of ordinary skill in the art willappreciate that many variations and alterations to the following detailscan be made and are considered included herein.

Accordingly, the following embodiments are set forth without any loss ofgenerality to, and without imposing limitations upon, any claims setforth. It is also to be understood that the terminology used herein isfor the purpose of describing particular embodiments only, and is notintended to be limiting. It should also be understood that terminologyemployed herein is used for describing particular examples orembodiments only and is not intended to be limiting. The same referencenumerals in different drawings represent the same element. Numbersprovided in flow charts and processes are provided for clarity inillustrating steps and operations and do not necessarily indicate aparticular order or sequence. Furthermore, the described features,structures, or characteristics can be combined in any suitable manner inone or more embodiments.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this disclosure belongs.

As used in this written description, the singular forms “a,” “an” and“the” include support for plural referents unless the context clearlydictates otherwise. Thus, for example, reference to “a bit line”includes support for a plurality of such bit lines.

In this application, “comprises,” “comprising,” “containing” and“having” and the like can have the meaning ascribed to them in U.S.Patent law and can mean “includes,” “including,” and the like, and aregenerally interpreted to be open ended terms. The terms “consisting of”or “consists of” are closed terms, and include only the components,structures, steps, or the like specifically listed in conjunction withsuch terms, as well as that which is in accordance with U.S. Patent law.“Consisting essentially of” or “consists essentially of” have themeaning generally ascribed to them by U.S. Patent law. In particular,such terms are generally closed terms, with the exception of allowinginclusion of additional items, materials, components, steps, orelements, that do not materially affect the basic and novelcharacteristics or function of the item(s) used in connection therewith.For example, trace elements present in a composition, but not affectingthe compositions nature or characteristics would be permissible ifpresent under the “consisting essentially of” language, even though notexpressly recited in a list of items following such terminology. Whenusing an open ended term in this written description, like “comprising”or “including,” it is understood that direct support should be affordedalso to “consisting essentially of” language as well as “consisting of”language as if stated explicitly and vice versa.

“The terms “first,” “second,” “third,” “fourth,” and the like in thedescription and in the claims, if any, are used for distinguishingbetween similar elements and not necessarily for describing a particularsequential or chronological order. It is to be understood that the termsso used are interchangeable under appropriate circumstances such thatthe embodiments described herein are, for example, capable of operationin sequences other than those illustrated or otherwise described herein.Similarly, if a method is described herein as comprising a series ofsteps, the order of such steps as presented herein is not necessarilythe only order in which such steps may be performed, and certain of thestated steps may possibly be omitted and/or certain other steps notdescribed herein may possibly be added to the method.

The terms “left,” “right,” “front,” “back,” “top,” “bottom,” “over,”“under,” and the like in the description and in the claims, if any, areused for descriptive purposes and not necessarily for describingpermanent relative positions. It is to be understood that the terms soused are interchangeable under appropriate circumstances such that theembodiments described herein are, for example, capable of operation inother orientations than those illustrated or otherwise described herein.

As used herein, “enhanced,” “improved,” “performance-enhanced,”“upgraded,” and the like, when used in connection with the descriptionof a device or process, refers to a characteristic of the device orprocess that provides measurably better form or function as compared topreviously known devices or processes. This applies both to the form andfunction of individual components in a device or process, as well as tosuch devices or processes as a whole.

As used herein, “coupled” refers to a relationship of electrical orphysical connection or attachment between one item and another item, andincludes relationships of either direct or indirect connection orattachment. Any number of items can be coupled, such as materials,components, structures, layers, devices, objects, etc.

As used herein, “directly coupled” refers to a relationship ofelectrical or physical connection or attachment between one item andanother item where the items have at least one point of direct physicalcontact or otherwise touch one another. For example, when one layer ofmaterial is deposited on or against another layer of material, thelayers can be said to be directly coupled.

Objects or structures described herein as being “adjacent to” each othermay be in physical contact with each other, in close proximity to eachother, or in the same general region or area as each other, asappropriate for the context in which the phrase is used.

As used herein, the term “substantially” refers to the complete ornearly complete extent or degree of an action, characteristic, property,state, structure, item, or result. For example, an object that is“substantially” enclosed would mean that the object is either completelyenclosed or nearly completely enclosed. The exact allowable degree ofdeviation from absolute completeness may in some cases depend on thespecific context. However, generally speaking, the nearness ofcompletion will be so as to have the same overall result as if absoluteand total completion were obtained. The use of “substantially” isequally applicable when used in a negative connotation to refer to thecomplete or near complete lack of an action, characteristic, property,state, structure, item, or result. For example, a composition that is“substantially free of” particles would either completely lackparticles, or so nearly completely lack particles that the effect wouldbe the same as if it completely lacked particles. In other words, acomposition that is “substantially free of” an ingredient or element maystill actually contain such item as long as there is no measurableeffect thereof.

As used herein, the term “about” is used to provide flexibility to anumerical range endpoint by providing that a given value may be “alittle above” or “a little below” the endpoint. However, it is to beunderstood that even when the term “about” is used in the presentspecification in connection with a specific numerical value, thatsupport for the exact numerical value recited apart from the “about”terminology is also provided.

As used herein, a plurality of items, structural elements, compositionalelements, and/or materials may be presented in a common list forconvenience. However, these lists should be construed as though eachmember of the list is individually identified as a separate and uniquemember. Thus, no individual member of such list should be construed as ade facto equivalent of any other member of the same list solely based ontheir presentation in a common group without indications to thecontrary.

Concentrations, amounts, and other numerical data may be expressed orpresented herein in a range format. It is to be understood that such arange format is used merely for convenience and brevity and thus shouldbe interpreted flexibly to include not only the numerical valuesexplicitly recited as the limits of the range, but also to include allthe individual numerical values or sub-ranges encompassed within thatrange as if each numerical value and sub-range is explicitly recited. Asan illustration, a numerical range of “about 1 to about 5” should beinterpreted to include not only the explicitly recited values of about 1to about 5, but also include individual values and sub-ranges within theindicated range. Thus, included in this numerical range are individualvalues such as 2, 3, and 4 and sub-ranges such as from 1-3, from 2-4,and from 3-5, etc., as well as 1, 1.5, 2, 2.3, 3, 3.8, 4, 4.6, 5, and5.1 individually.

This same principle applies to ranges reciting only one numerical valueas a minimum or a maximum. Furthermore, such an interpretation shouldapply regardless of the breadth of the range or the characteristicsbeing described.

Reference throughout this specification to “an example” means that aparticular feature, structure, or characteristic described in connectionwith the example is included in at least one embodiment. Thus,appearances of the phrases “in an example” in various places throughoutthis specification are not necessarily all referring to the sameembodiment.

Example Embodiments

An initial overview of the embodiments is provided below and specificembodiments are then described in further detail. This initial summaryis intended to aid readers in understanding the disclosure more quickly,but is not intended to identify key or essential technological features,nor is it intended to limit the scope of the claimed subject matter.

In computing, interleaved memory is a design made to compensate for therelatively slow speed of dynamic random-access memory (DRAM) memory, byspreading memory addresses evenly across memory channels. In this way,contiguous memory read and write operations use each memory channel inturn, resulting in higher memory throughputs. This is achieved byallowing memory channels to perform the desired operations in parallel,yet not forcing individual non-contiguous memory transactions intoissuing the excessively large transactions that would result if the databus to memory were to be merely widened. Memory systems, including onelevel (1LM) memory systems that implement high bandwidth using multiplememory controllers, such as DRAM, can interleave memory transactionsbetween controllers.

An operating system (OS) allocates memory in chunks. For example, aprogram executing on the OS may request an allocation of memory for itsdata and the OS will provide this allocation as a non-sequential seriesof chunks of a specified size. The use of fixed-size chunks whenallocating memory allows large allocations of memory to be made evenwhere, as a result of continuous software operations, memory has becomehighly fragmented. In one embodiment, a typical OS will allocate memoryin 4K bytes or chunks (4 KByte).

A system may implement a plurality of memory controllers to increaseefficiency. However, it is not desirable that interleave granularity be4K between memory controllers, as this may result in a read of an entire4K chunk being serviced by only a single memory controller, and singlememory channel. Therefore, requests can be interleaved at a size smallerthan the size allocated by the OS. For example, requests for 256 bytesof data interleaved between controllers at 128 byte granularity can beserviced by more than one memory controller in parallel. Similarly, arequest to read an entire 4 Kbyte OS page could be serviced by multiplecontrollers in parallel.

A memory system with two cache controllers connected to two memorycontrollers may maintain tags within each cache controller forhalf-OS-pages rather than OS-pages, causing 100% size/cost impact forthe large tag arrays. A different memory system may limit the interleavebetween cache controllers to OS page size, causing a 50% loss in streambandwidth. A different memory system may, in addition to limiting theinterleave between cache controllers to OS page size, add a memoryfabric between cache controllers and memory controllers, causing amulti-cycle latency penalty.

One or more cache controllers may be implemented in memory systems tocontrol local storage of cached data. In adapting such a system toinclude a memory-side cache, such as in a two level memory (2LM) system,bandwidth requirements typically necessitate the use of multiple cachecontrollers. The memory may store all the data but may be slow andtherefore a portion of the data stored in the memory will be storedlocally in the cache and managed by the cache controllers. In oneembodiment, the cache controllers are capable of holding entries thatrelate to 4 Kbyte of memory allocations, in line with the allocationgranularity of an OS. The cache controllers may store data locally andhold the metadata on-die in a static random-access memory (SRAM) arrayto allow quick identification of the data stored locally. The cachecontrollers may store metadata that will typically include cache tags.Each cache controller has an upper limit of how many cache tags orpieces of metadata may be stored. Various embodiments provide a metadatastore fabric that provides a plurality of cache controllers with sharedaccess to a plurality of metadata stores. A metadata store fabric may behardware that is a set of connections between metadata stores and cachecontrollers that allow an exchange of data between the metadata storesand the cache controllers.

From a metadata storage perspective, efficient implementation of adesign with multiple cache controllers requires interleaving between thecache controllers at OS page granularity or greater. In one embodiment,reconciling this with the desire to interleave memory controllers assub-OS page granularity may involve trade-offs in performance.Embodiments exemplified herein include memory devices, systems andmethods that re-distribute storage and handling of memory-side cachemetadata utilizing a mesh structure between multiple cache controllersand multiple metadata stores. The mesh structure may be a hardwarestructure and may also be referred to as a “metadata store fabric” orsimply “fabric”. The metadata stores may store the metadata or cachetags as shared distributed metadata. The shared distributed metadataallows a first cache controller to send information such as cache tagsor metadata to a metadata store connected through the metadata storefabric. The metadata store then converts or stores the cache tag to ashared distributed metadata and provides a shared access to the shareddistributed metadata allowing a second cache controller to access theshared distributed metadata that is based on the information from thefirst cache controller. This allows the second cache controller to carryout an operation based on cache tags or metadata without the need toallocate an additional metadata entry. Thus the second cache controller,or all of the cache controllers in the memory system, may be able tooperate more efficiently at a higher bandwidth without increasing thecapacity or size of the local store of the cache controller. Forexample, 256-byte requests being handled by two cache controllers inparallel and handled by two memory controllers in parallel. In oneembodiment, the present disclosure utilizes tag and valid bits. The tagsand valid bits are part of the metadata or shared distributed metadatathat allow operations on the memory to occur. The shared distributedmetadata also introduces lock bits that lock the shared distributedmetadata until the lock bit is cleared by the associated cachecontroller. This ensures that the shared distributed metadata is notcleared from a metadata store before it is no longer needed foroperations and possible update by a given cache controller. The meshstructure allows for efficient operation with OS-page-granularity cacheentries, and hence metadata entries, in terms of metadata usage. Themesh also allows for efficient memory interleaving between cachecontrollers at sub-OS-page-size granularity in terms of optimized datapath.

The use of metadata stores, metadata store fabric, and shareddistributed metadata allow the data to flow through a cache controllerwithout requiring the cache controller to locally store all metadatabecause it is being stored in the metadata store. In one embodiment, thepresent disclosure may be combined with various techniques to achievezero additional latency for all cache hit transactions even whensub-page interleaving is used.

FIG. 1 shows a system-on-chip (SOC) 102 with a basic 1LM system. The SOC102 includes a central processing unit (CPU) 104 for processing data ina computer system. It should be appreciated that CPU 104 may compriseintegrated caches, not pictured, which are integrated into subsystems ofCPU 104. The SOC 102 also comprises an integrated display controller106, a controller to control output data to a user being displayed on adisplay such as a screen. The SOC 102 additionally comprises an 10subsystem 108, which is an input output system for inputting andoutputting data for system 102. The SOC 102 also comprises a systemfabric 110, which can be a hardware fabric for connecting a memorycontroller 112 and other components of the SOC 102 together. The memorycontroller 112 is a dedicated hardware incorporated into the SOC 102 forcontrolling the memory 114. In one embodiment, the memory 114 is DRAM,but it should be appreciated that the memory 114 may be other types ofmemory as well. In one embodiment, FIG. 1 shows a 1LM system where theoperating system employs a 4 KByte page and memory 114 has a 4 KBytepage size. In one example, two adjacent OS-allocated pages of data, “A”and “B” are shown stored in memory 114 as shown. While the illustrationof FIG. 1 shows a system-on-chip (SOC) 102, it may equally apply to acomputer system built with more discrete components, for example wheredisplay controller 106 and an IO subsystem 108 are outside the boundaryof element 102, and where element 102 represents a CPU with integratedsystem fabric 110 and memory controller 112.

FIG. 2 shows a 1LM system with a SOC 200 that has multiple memorycontrollers. The SOC 200 may include some of the components of SOC 102.In one embodiment, the SOC 200 includes two memory controllers,specifically the memory controller 204 and the memory controller 206that are connected to the system fabric 110 via a memory fabric 202. Thememory fabric 202 is hardware configured to interleave across the twomemory controllers as well as the memory 208 and the memory 210. Forexample, the interleave may occur every 4K bytes. In one configuration,when the system is reading from page A, only the memory controller 204and the memory 208 are servicing the requests, and likewise, when it isreading from page B, only the memory controller 206 and the memory 210are servicing the requests. Thus, although the memory controller andmemory bandwidth has theoretically been doubled, the peak streambandwidth of the system of FIG. 2 will remain little changed whencompared to the system of FIG. 1.

FIG. 3 shows a 1LM system with a SOC 300 that has multiple memorycontrollers. The SOC 300 may comprise some of the components of the SOC102 and/or 200 and illustrates how the memory fabric 202 of SOC 200 maybe configured differently in FIG. 3. In one embodiment, the SOCs 102,200, and 300 depict examples where a SOC may issue multiple readrequests simultaneously. However, the SOC 300 depicts embodiments thatimprove or optimize the performance ‘stream bandwidth’ where suchmultiple read requests exist. For example, the system may request toread 256 bytes, which may be one sixteenth of a memory page such as aDRAM page. In embodiments of system 300, each OS page has beensliced—such that A becomes A₀ and A₁. For example, A₀ contains data forbytes 0-127, 256-383, 512-639, 768-895, 1024-1151, 1280-1407, 1536-1663,1792-1919, 2048-2175, 2304-2431, 2560-2687, 2816-2943, 3072-3199,3328-3455, 3584-3711, 3840-3967 and A₁ contains data for bytes 128-255,384-511, 640-767, 896-1023, 1152-1279, 1408-1535, 1664-1791, 1920-2047,2176-2303, 2432-2559, 2688-2815, 2944-3071, 3200-3327, 3456-3583,3712-3839, 3968-4095 within the page. Thus, a request to read 256sequential bytes, such as from address 512 to address 767, will beserviced by both the memory controller 204 and the memory 302 (bytes512-639) and the memory controller 206 and the memory 304 (bytes640-767), realizing a doubling of bandwidth compared to the SOC 102 ofFIG. 1.

FIG. 4. shows a 2LM system with a SOC 400. The SOC 400 may include someof the components of the SOCs 102, 200, and/or 300. The SOC 400 depictsembodiments which further include a cache controller 408 and a cachecontroller 410 disposed between the system fabric 110 and the memorycontroller 204 and the memory controller 206 respectively. FIG. 4depicts a memory 402 further comprising a memory 404 and SOC memorycontroller 406. The memory controller 406 is connected to the cachecontroller 408 and the cache controller 410. FIG. 4 also depicts amemory 418 comprising the memory 414 and memory 416 connected to SOCmemory controller 204 and SOC memory controller 206 respectively. Memory414 and 416 provide relatively fast data storage for the cachecontroller 408 and the cache controller 410, thereby allowing fastaccess to cached data of memory 404. The storage of pages A and B in thememory 414 and 416 may be similar to what is described in the system ofFIG. 3. However, the position of the pages within each memory may beinfluenced by the organizational policies, such as the use of ways 0, 1,2, 3, and 4, of the cache controller 408 and the cache controller 410.

In system SOC 400, separate arrays of cache tags (not shown) exist ineach of the cache controller 408 and the cache controller 410 or arestored separate from cache controller 408 and the cache controller 410but are accessible to cache controller 408 and the cache controller 410.The cache tags are references to which portions of the main memory 404are held in which pages of the cache and are maintained by each cachecontroller. Thus, for a single OS page “in use” such as A, there is adouble overhead of assigning, storing, looking-up, and maintaining tagswhere the cache controller 408 is for maintaining the tag for A₀ and thecache controller 410 is for maintaining the tag for A₁. One designapproach to avoid this double overhead is to use a single cachecontroller. However, in many cases, due to bus throughput or otherscaling issues, memory controller location, or the integration of thememory controller/cache controller fabric into the system fabric, thisapproach of only a single cache controller is impractical. Thus in apractical system, multiple cache controllers are matched to multiplememory controllers. Another approach to solving the double tag problemof the system shown in FIG. 4 is the system of FIG. 5.

FIG. 5 shows a 2LM system with a SOC 500. The SOC 500 may comprise someof the components of the SOCs 102, 200, 300, and/or 400. The SOC 500depicts a larger interleave between the two cache controllers (forexample 4 KByte) as compared to the SOC 400 of FIG. 4, such that anentire OS page is handled by a single cache controller. However, thislarge interleave causes the bandwidth limitations similar to the SOC 200of FIG. 2, as only one memory controller handles each OS page.

FIG. 6 shows a 2LM system with a SOC 600. The SOC 600 may comprise someof the components of the SOCs 102, 200, 300, 400, and/or 500. The SOC600 connects to the memory 604 and memory 608. The SOC 600 depicts anadditional fabric, a memory fabric 602, which is disposed between thecache controllers 408 and 410 and the memory controllers 204 and 206.Memory fabric 602 provides interleaving at the memory with sub-OS-pagegranularity (for example, 128 bytes or other values), while stillallowing the cache controllers to be interleaved by memory fabric 202 atOS-page granularity (for example 4 KByte). However, there is the addedlatency impact of the memory fabric 602, and the desire that individualcache controllers each be capable of handling the full bandwidth of bothmemory controllers.

FIG. 7A, shows a 2LM system with a SOC 700 in accordance with variousembodiments. The SOC 700 may comprise some of the components of the SOCs102, 200, 300, 400, 500, and/or 600. The SOC 700 depicts embodiments ofthe present disclosure that may overcome at least some of the describedlimitations of the SOCs 400, 500, and/or 600. The SOC 700 depicts ametadata store fabric 702, a metadata store 704, and a metadata store706. FIG. 7A further depicts SOC 700 connected to memories 708 and 710.In one embodiment, the metadata stores 704 and 706 are on-die metadatastorage blocks that service the cache controllers, but are separatedfrom the cache controllers and each serve a multiplicity of cachecontrollers. In one embodiment, the metadata store is a staticrandom-access memory (SRAM) array. In other words, metadata storage isextracted out or away from the cache controllers with the implementationof the metadata stores. Each metadata store may serve a multiplicity ofthe cache controllers in the SOC 700. In one embodiment, the metadatastores 704 and 706 are assembled as separate metadata storages but canbe implemented in the same or different memory devices. It should beappreciated that the SOC 700 depicts two cache controllers and twometadata stores, but any number or combination of cache controllers andmetadata stores can be used. In a given SOC for example, the number ofcache controllers may be greater than the number of metadata stores, thenumber of metadata stores may be greater than the number of cachecontrollers, or the system may include only one metadata store for aplurality of cache controllers.

In one embodiment, within each metadata store, a logic block is addedthat is assigned responsibility for some of the tasks that wouldgenerally be assigned to a cache controller. For example, these tasksmay include maintaining least recently used (LRU) indications, andre-allocating the clean entry with the highest LRU when a cacheallocation to a new system memory address is required. Variousembodiments may achieve the same interleave as shown in the SOC 600 ofFIG. 6, but without the latency and wide data paths of the memoryfabric. The additional latency of the metadata store fabric may bemitigated by the use of various techniques. For example,identically-offset fragments of the pages stored in multiple ways of acache set are stored together in a single DRAM page, facilitating theability of the memory controller to issue the DRAM page open requests onthe assumption that the requested data will be found in the cache, butprior to knowing in which way it is to be found.

FIG. 7B shows a 2LM system with a SOC 701 upon which embodiments of thepresent disclosure are implemented. The SOC 701 may be described as analternate configuration of SOC 700 of FIG. 7A. For example, the metadatastores 704 and 706 in SOC 701 are each physically co-located with thecache controllers 408 and 410 respectively. Despite the physicalproximity of metadata stores 704 and 706 to the cache controllers 408and 410, the presence of the metadata store fabric 702 allows themetadata stores 704 and 706 to logically operate in a similar manner towhat was described for SOC 700 of FIG. 7A. For example, FIG. 7B depictsdifferent locations of the metadata stores 704 and 706 relative to thelocations of the metadata stores 704 and 706 for FIG. 7A. However, thesedifferent locations of the metadata stores 704 and 706 need not affectthe general connectivity of the metadata stores 704 and 706 to the cachecontrollers 408 and 410 for the described operations. It should beappreciated that the physical proximity of each metadata store to acache controller may allow simplified construction of derivativedesigns, for example a ‘chopped’ derivative containing only one cachecontroller, one metadata store, and one memory controller, or a‘high-end’ derivative containing four cache controllers, four metadatastores, and four memory controllers.

FIG. 7C shows a 2LM system with a SOC 703 upon which embodiments of thepresent disclosure are implemented. SOC 703 is an alternateconfiguration of the SOC 700 of FIG. 7A or SOC 701 of FIG. 7C. Thesystem 703 further comprises the common logic block 710. In oneembodiment, the common logic block 710 is a logic block connected withthe metadata store fabric 702. The common logic block 710 is added toSOC 703 so that each of the metadata stores are not required to comprisetheir own logic block that is responsible for tasks. For example, thesetasks may include scrubbing, maintaining least recently used (LRU)indications, and re-allocating the clean entry with the highest LRU whena cache allocation to a new system memory address is required. Forexample, the tasks may be tasks that would be assigned to the cachecontroller, but instead are executed by common logic block 710operations on the metadata stores. In one embodiment, FIGS. 7A-C of thepresent disclosure depict the development of a “shared metadata entry”that allows cache controllers to each access shared, distributedmetadata without the risk of corrupting metadata used by the other cachecontrollers sharing that metadata entry.

FIG. 8A is a representation of one type of a standard metadata entry.For example, a metadata 802 is a standard or typical metadata entry andmay be used in a set-associative sectored cache. In one embodiment, themetadata 802 employs fourteen tag bits for address matching. In such anembodiment, eight valid bits each report on the validity of 512 bytes ofdata of that entry. Likewise, eight dirty bits indicate whether the datamust be scrubbed to main memory before the entry is re-allocated. ThreeLRU bits track order of use (in relation to other entries of the samecache set), and one Pinned “P” bit captures software has requested thatthe entry not be re-allocated.

FIG. 8B depicts shared metadata entry 804 which may be shared metadataentry among cache controllers as employed by embodiments of the presentdisclosure. A division of “valid” and “dirty” bits occurs according toeach controller. For example, valid[3:0] may refer to bytes 0-127,256-383, 512-639 and 768-895, all of which may be handled by cachecontroller 408 of FIG. 7A. Additionally, valid[7:4] may refer to bytes128-255, 384-511, 640-767, and 896-1023, all of which may be handled bycache controller 410 of FIG. 7A. “Lock” bits are included for each cachecontroller. It should be appreciated that “lock” bits relate to thevalid and dirty bits of a given cache controller. For example, lock 0(depicted as L[0]) would relate to valid[3:0] and dirty[3:0] for cachecontroller 408 of FIG. 7A. Lock 1 (depicted as L[1]) would relate tovalid[7:4] and dirty[7:4] for cache controller 410 of FIG. 7A. Forexample, an assertion of a “lock” bit indicates that the respectivecontroller has taken a local copy of its “dirty” and “valid” bits forthat entry, and that these should not be changed except by that cachecontroller.

In one embodiment, the shared metadata entry 804 may be further enhancedby the addition of a lock bit related to the common logic block 710 ofFIG. 7C. Such a lock bit is not strictly needed but may be optionallyadded. An additional lock bit L[2] (not depicted) may be added to themetadata entry 804. The additional lock bit L[2] may force the metadatastore to know to request that the common logic block 710 completed itstask and released this lock prior to giving the metadata from themetadata store to the requesting cache controller.

In one embodiment of a system with multiple cache controllers, anyentry, which is not in use by any controller, will have its “lock” bitsclear. The metadata store is free to initiate scrub of the dirty datafor that entry, and, for clean entries, re-allocate at will. Forexample, a re-allocation may occur according to an least recently used(LRU) protocol or other algorithm. When one of the cache controllersreceives a transaction to a memory address, it sends a request to theappropriate metadata store to check the appropriate tags for a match(indicating that this memory address has been allocated in the cache),such tags are common to the cache controllers. If a match is found, acopy of the contents of that entry is delivered by the metadata store tothe requesting cache controller and the lock bit pertaining to therequesting cache controller is set in the entry at the metadata store.In such an embodiment, the copy of the contents delivered to therequesting cache controller need not include valid or dirty bitsbelonging to one of the other controllers. The receiving cachecontroller serves that transaction as well as any further ones to theother parts of same OS page that are assigned to it due to the choseninterleave. In one embodiment, the cache controller may update thevalues of its local copy of the “valid” and “dirty” bits for that entryto reflect the cache operations it has performed. In one embodiment,when the cache controller has completed handling all transactionsrelating to this entry, it will send an update to the metadata store ofthe appropriate “valid” and “dirty” bits for that cache controller. Inone embodiment, the receipt of this update causes the lock bit for therequesting cache controller to be cleared in the entry at the metadatastore.

By virtue of the assignment shown for shared metadata entry 804regarding which parts of the “valid” and “dirty” fields may be updatedby a given cache controller, avoids the problem of stale metadatabelonging to one cache controller being written as part of an update byone of the other cache controllers. Such a mechanism allows multiplecache controllers independently and simultaneously, with nosynchronization or communication between them, to access and update asingle shared metadata entry, without risk of corrupting the “valid” or“dirty” bits relating to data of the entry handled by one of the othercache controllers because the shared meta data entry is locked.

In one embodiment, once the “lock” bits are clear, the metadata storewill again be able to perform scrubbing and re-allocation of entries. Asone approach to prevent deadlock cases, the metadata store may also havea mechanism or protocol to instruct a cache controller to send itsupdate in order to release the lock bit.

In reference to tasks and metadata entries in metadata stores, scrubbingis the process of taking a ‘dirty’ cache data entry (i.e., one thatcontains newer data than the main memory) and making it ‘clean’ (i.e.,containing the same data as main memory). Conversely, a ‘clean’ cachedata entry may become ‘dirty’ as a result of a write command with newdata being received from the CPU. Scrubbing is accomplished by copyingthe data from the cache to the main memory, which results in the data ofboth cache and main memory being once again identical, hence this cachedata entry can now be considered ‘clean’.

In one embodiment, scrubbing dirty cache data while a lock bit for thatentry is set may be possible, provided that the cache controller thatset the lock bit indicates to the metadata store whether additionalwrites were received to that data while the entry was “locked”. Forexample, this may be because the cache controller has taken a local copyof its “dirty” and “valid” bits for that entry. It is sufficient for acache controller to notify a metadata store whether additional writes(for example from the CPU) were received to cache data that was alreadydirty to allow the metadata store to decide whether entry that wasscrubbed while ‘locked’ may remain clean (if no additional writes werereceived, and thus cache data is the same data as main memory), orshould be dirty (if additional writes were received and written to thecache data, cache data is not expected to be the same data as mainmemory).

In one embodiment, when serving transaction requests from an agent thatmay be expected to access a stream of data, the metadata store maychoose to pro-actively send metadata also to cache controller(s) thatdid not request it, and to set the appropriate lock bit. For example,the stream of data may be a display controller streaming data to thedisplay as advised to the metadata store by the cache controller.

The non-requesting cache controllers may then match incoming cacheaccess requests against the metadata and know not to send a metadatarequest to the metadata store because they already have the results forsuch a metadata request. This will allow those controllers to beprepared should they receive a request to the same OS page as wasrequested in the initial request.

In one embodiment, logic of the metadata store could request that thecache controllers perform the scrubbing. For example, the logic of themetadata store could send a request to the cache controller to write thecache data for a particular entry to main memory and notify the metadatastore when that was done. In another embodiment, the metadata store mayread the data cached by the cache controllers from the memory accessedby the memory controllers, either directly or via request to the cachecontrollers, and write this to main memory. This may be done directly bythe metadata store sending requests to the memory controllers (eitherdirectly, or by sending requests to the cache controllers to beforwarded to the memory controllers) and receiving data from the memorycontrollers (either directly, or by the memory controllers sending datato the cache controllers which in turn would send it to the metadatastore for reception) and, having received that data from the memorycontroller, writing it to main memory.

FIG. 9 depicts an exemplary system upon which embodiments of the presentdisclosure may be implemented. For example, the system of FIG. 9 may bea computer system. The system can include a memory controller 902, aplurality of memory 904, a processor 906, and circuitry 908. Thecircuitry can be configured to implement the hardware described hereinfor system 700, 701, and/or 703 of FIGS. 7A-C. Various embodiments ofsuch systems for FIG. 9 can include smart phones, laptop computers,handheld and tablet devices, CPU systems, SoC systems, server systems,networking systems, storage systems, high capacity memory systems, orany other computational system.

The system can also include an I/O (input/output) interface 910 forcontrolling the I/O functions of the system, as well as for I/Oconnectivity to devices outside of the system. A network interface canalso be included for network connectivity, either as a separateinterface or as part of the I/O interface 910. The network interface cancontrol network communications both within the system and outside of thesystem. The network interface can include a wired interface, a wirelessinterface, a Bluetooth interface, optical interface, and the like,including appropriate combinations thereof. Furthermore, the system canadditionally include various user interfaces, display devices, as wellas various other components that would be beneficial for such a system.

The system can also include memory in addition to memory 904 that caninclude any device, combination of devices, circuitry, and the like thatis capable of storing, accessing, organizing and/or retrieving data.Non-limiting examples include SANs (Storage Area Network), cloud storagenetworks, volatile or non-volatile RAM, phase change memory, opticalmedia, hard-drive type media, and the like, including combinationsthereof.

The processor 906 can be a single or multiple processors, and the memorycan be a single or multiple memories. The local communication interfacecan be used as a pathway to facilitate communication between any of asingle processor, multiple processors, a single memory, multiplememories, the various interfaces, and the like, in any usefulcombination.

Although not depicted, any system can include and use a power supplysuch as but not limited to a battery, AC-DC converter at least toreceive alternating current and supply direct current, renewable energysource (e.g., solar power or motion based power), or the like.

The disclosed embodiments may be implemented, in some cases, inhardware, firmware, software, or any combination thereof. Portions ofthe disclosed embodiments may also be implemented as instructionscarried by or stored on a transitory or non-transitory machine-readable(e.g., computer-readable) storage medium, which may be read and executedby one or more processors. A machine-readable storage medium may beembodied as any storage device, mechanism, or other physical structurefor storing or transmitting information in a form readable by a machine(e.g., a volatile or non-volatile memory, a media disc, or other mediadevice).

Reference to storage, stores, memory, or memory devices can refer tomemory whose state is in-determinate if power is interrupted to thedevice (e.g., DRAM) or to memory devices whose state is determinate evenif power is interrupted to the device. In one embodiment, such anadditional memory device can comprise a block addressable mode memorydevice, such as planar or multi-dimensional NAND or NOR technologies, ormore specifically, multi-threshold level NAND flash memory, NOR flashmemory, and the like. A memory device can also include abyte-addressable three dimensional crosspoint memory device, or otherbyte addressable write-in-place nonvolatile memory devices, such assingle or multi-level Phase Change Memory (PCM), memory devices that usechalcogenide phase change material (e.g., chalcogenide glass), resistivememory, nanowire memory, ferroelectric transistor random access memory(FeTRAM), magnetoresistive random access memory (MRAM) memory thatincorporates memristor technology, or spin transfer torque (STT)-MRAM.

FIG. 10 depicts a flowchart of a method for sharing metadata andmetadata stores. The method can be executed as instructions on amachine, where the instructions are included on at least one computerreadable medium or one non-transitory machine-readable storage medium.In one embodiment, the circuitry 908 of FIG. 9 is configured to carryout the steps of FIG. 10. Moreover, the systems depicted in FIGS. 7A-Cmay be employed to carry out the steps of FIG. 10. The method caninclude the operation of: connect a metadata store with a plurality ofcache controllers via a metadata store fabric, as in block 1002. Themethod can include the operation of: receive information at the metadatastore from at least one of the plurality of cache controllers, as inblock 1004. The method can include the operation of: store theinformation as shared distributed metadata in the metadata store, as inblock 1006. The method can include the operation of: providing sharedaccess of the shared distributed metadata to the plurality of cachecontrollers, as in block 1008. The method can include the operation of:assign a task to a logic block wherein the task executed at the logicblock operates on the shared distributed metadata, as in block 1010. Themethod can include the operation of: lock the valid bits and dirty bitsof a given cache controller via a lock bit indicating that the validbits and dirty bits of the given cache controller should not be changedexcept by the given cache controller, as in block 1012. The method caninclude the operation of: upon completion of relevant transactions at agiven cache controller, update the appropriate metadata store ofappropriate valid bits and dirty bits which causes a lock bit to becleared, as in block 1014. It should be appreciated that the steps ofFIG. 10 may not include all of the steps depicted nor in the order inwhich they are depicted.

EXAMPLES

The following examples pertain to specific embodiments and point outspecific features, elements, or steps that can be used or otherwisecombined in achieving such embodiments.

In one example, there is provided, a memory system, comprising:

a plurality of cache controllers with circuitry configured to:

-   -   access memory controllers which access memory;

a metadata store in communication with the at least one cache controllerwith circuitry configured to:

-   -   receive information from at least one of the plurality of cache        controllers, a portion of which is stored as shared distributed        metadata;    -   provide shared access of the shared distributed metadata hosted        to the plurality of cache controllers; and

a metadata store fabric disposed between the plurality of cachecontrollers and the at least one metadata store to facilitate the sharedaccess.

In one example of a memory system, the information is related to a taskassigned to one of the plurality of cache controllers.

In one example of a memory system, the metadata store fabric furthercomprises a common logic block to manage the task assigned to one of theplurality of cache controllers.

In one example of a memory system, the metadata store further comprisesa logic block to manage the task assigned to one of the plurality ofcache controllers.

In one example of a memory system, the metadata store is one of aplurality of metadata stores.

In one example of a memory system, the metadata store is one of aplurality of metadata stores and the number of the plurality of metadatastores corresponds to the number of the plurality of cache controllers.

In one example of a memory system, the metadata store is one of aplurality of metadata stores and the number of the plurality of metadatastores is greater than the number of the plurality of cache controllers.

In one example of a memory system, the metadata store is a staticrandom-access memory (SRAM) array.

In one example of a memory system, one of the tasks assigned to themetadata store comprises maintaining least recently used (LRU)indications.

In one example of a memory system, one of the tasks assigned to themetadata store comprises re-allocating an entry based on the leastrecently used (LRU) indication when a new system memory address is to becached.

In one example of a memory system, the shared distributed metadatahosted by the metadata store comprises valid bits and dirty bits.

In one example of a memory system, the shared distributed metadatahosted by the metadata store comprises lock bits pertaining to theplurality of cache controllers.

In one example of a memory system, a lock bit is to assert that thevalid bits and dirty bits of a given cache controller are locked and arenot changed except by the given cache controller.

In one example of a memory system, one of the plurality of cachecontrollers, upon completion of all transactions relating to a metadataentry, is to update the metadata store of appropriate valid bits anddirty bits and cause a lock bit to be cleared.

In one example of a memory system, a logic block is configured toidentify dirty entries for a scrubbing operation wherein the logic blockis associated with the metadata store fabric or the metadata store.

In one example, there is provided, a system, comprising:

one or more processors configured to process data;

an input output subsystem configured to receive input data and to outputdata;

a plurality of memory controllers to access a plurality of memory;

a plurality of cache controllers with circuitry configured to:

-   -   access memory controllers which access memory;

a cache controller fabric disposed between the system fabric and theplurality of cache controllers;

a metadata store in communication with the plurality of cachecontrollers with circuitry configured to:

-   -   receive information from at least one of the plurality of cache        controllers, a portion of which is stored as shared distributed        metadata;    -   provide shared access of the shared distributed metadata hosted        to the plurality of cache controllers;

a metadata store fabric disposed between the plurality of cachecontrollers and the plurality of metadata stores; and

a system fabric configured to connect the one or more processors and theinput output subsystem to the plurality of memory controllers and theplurality of cache controllers.

In one example of a system, the information is related to a taskassigned to one of the plurality of cache controllers.

In one example of a system, the metadata store fabric further comprisesa common logic block to manage the task assigned to one of the pluralityof cache controllers.

In one example of a system, the metadata store further comprises a logicblock to manage the task assigned to one of the plurality of cachecontrollers.

In one example of a system, the metadata store is one of a plurality ofmetadata stores.

In one example of a system, the metadata store is one of a plurality ofmetadata stores and the number of the plurality of metadata storescorresponds to the number of the plurality of cache controllers.

In one example of a system, the metadata store is one of a plurality ofmetadata stores and the number of the plurality of metadata stores isgreater than the number of the plurality of cache controllers.

In one example of a system, the metadata store is a static random-accessmemory (SRAM) array.

In one example of a system, one of the tasks assigned to the metadatastore comprises maintaining least recently used (LRU) indications.

In one example of a system, one of the tasks assigned to the metadatastore comprises re-allocating an entry based on the least recently used(LRU) indication when a new system memory address is to be cached.

In one example of a system, the shared distributed metadata hosted bythe metadata store comprises valid bits and dirty bits.

In one example of a system, the shared distributed metadata hosted bythe metadata store comprises lock bits pertaining to the plurality ofcache controllers.

In one example of a system, a lock bit is to assert that the valid bitsand dirty bits of a given cache controller are locked and are notchanged except by the given cache controller.

In one example of a system, one of the plurality of cache controllers,upon completion of all transactions relating to a metadata entry, is toupdate the metadata store of appropriate valid bits and dirty bits andcause a lock bit to be cleared.

In one example of a system, a logic block is configured to identifydirty entries for a scrubbing operation wherein the logic block isassociated with the metadata store fabric or the metadata store.

In one example, there is provided, a method comprising:

connecting a metadata store with a plurality of cache controllers via ametadata store fabric;

receiving information at the metadata store from at least one of theplurality of cache controllers;

storing the information as shared distributed metadata in the metadatastore;

providing shared access of the shared distributed metadata to theplurality of cache controllers; and

assigning a task to a logic block wherein the task executed at the logicblock operates on the shared distributed metadata.

In one example of a method, the metadata store is one of a plurality ofmetadata stores.

In one example of a method, the plurality of cache controllers and themetadata store are interconnected via a metadata store fabric.

In one example of a method, the metadata store fabric comprises a commonlogic block to manage the task assigned to one of the plurality of cachecontrollers.

In one example of a method, the metadata store further comprises a logicblock to manage the task assigned to one of the plurality of cachecontrollers.

In one example of a method, the metadata store is a static random-accessmemory (SRAM) array.

In one example of a method, the task assigned to the metadata storecomprises maintaining least recently used (LRU) indications.

In one example of a method, the task assigned to the metadata storecomprises re-allocating a clean entry with a higher least recently used(LRU) indication when a new system memory address is to be cached.

In one example of a method, the shared distributed metadata hosted bythe metadata store comprises lock bits, valid bits, and dirty bits.

In one example of a method, locking the valid bits and dirty bits of agiven cache controller via a lock bit indicating that the valid bits anddirty bits of the given cache controller are not be changed except bythe given cache controller.

In one example of a method, upon completion of relevant transactions ata given cache controller, updating the appropriate metadata store ofappropriate valid bits and dirty bits and cause a lock bit to becleared.

What is claimed is:
 1. A memory system, comprising: a plurality of cachecontrollers with circuitry configured to: access memory controllerswhich access memory; a metadata store in communication with the at leastone cache controller with circuitry configured to: receive informationfrom at least one of the plurality of cache controllers, a portion ofwhich is stored as shared distributed metadata; provide shared access ofthe shared distributed metadata hosted to the plurality of cachecontrollers; and a metadata store fabric disposed between the pluralityof cache controllers and the at least one metadata store to facilitatethe shared access.
 2. The memory system of claim 1, wherein theinformation is related to a task assigned to one of the plurality ofcache controllers.
 3. The memory system of claim 2, wherein the metadatastore fabric further comprises a common logic block to manage the taskassigned to one of the plurality of cache controllers.
 4. The memorysystem of claim 2, wherein the metadata store further comprises a logicblock to manage the task assigned to one of the plurality of cachecontrollers.
 5. The memory system of claim 1, wherein the metadata storeis one of a plurality of metadata stores.
 6. The memory system of claim1, wherein the metadata store is one of a plurality of metadata storesand the number of the plurality of metadata stores corresponds to thenumber of the plurality of cache controllers.
 7. The memory system ofclaim 1, wherein the metadata store is one of a plurality of metadatastores and the number of the plurality of metadata stores is greaterthan the number of the plurality of cache controllers.
 8. The memorysystem of claim 1, wherein the metadata store is a static random-accessmemory (SRAM) array.
 9. The memory system of claim 1, wherein one of thetasks assigned to the metadata store comprises maintaining leastrecently used (LRU) indications.
 10. The memory system of claim 1,wherein one of the tasks assigned to the metadata store comprisesre-allocating an entry based on the least recently used (LRU) indicationwhen a new system memory address is to be cached.
 11. The memory systemof claim 1, wherein the shared distributed metadata hosted by themetadata store comprises valid bits and dirty bits.
 12. The memorysystem of claim 1, wherein the shared distributed metadata hosted by themetadata store comprises lock bits pertaining to the plurality of cachecontrollers.
 13. The memory system of claim 12, wherein a lock bit is toassert that the valid bits and dirty bits of a given cache controllerare locked and are not changed except by the given cache controller. 14.The memory system of claim 1, wherein one of the plurality of cachecontrollers, upon completion of all transactions relating to a metadataentry, is to update the metadata store of appropriate valid bits anddirty bits and cause a lock bit to be cleared.
 15. The memory system ofclaim 14, wherein a logic block is configured to identify dirty entriesfor a scrubbing operation wherein the logic block is associated with themetadata store fabric or the metadata store.
 16. A method, comprising:connecting a metadata store with a plurality of cache controllers via ametadata store fabric; receiving information at the metadata store fromat least one of the plurality of cache controllers; storing theinformation as shared distributed metadata in the metadata store;providing shared access of the shared distributed metadata to theplurality of cache controllers; and assigning a task to a logic blockwherein the task executed at the logic block operates on the shareddistributed metadata.
 17. The method of claim 16, wherein the metadatastore is one of a plurality of metadata stores.
 18. The method of claim16, wherein the plurality of cache controllers and the metadata storeare interconnected via a metadata store fabric.
 19. The method of claim16, wherein the plurality of cache controllers and a plurality ofmetadata store are interconnected via a metadata store fabric.
 20. Themethod of claim 16, wherein a metadata store fabric comprises a commonlogic block to manage metadata operations.
 21. The method of claim 16,wherein the metadata store further comprises a logic block to managemetadata operations.
 22. The method of claim 16, wherein the metadatastore stores metadata in a static random-access memory (SRAM) array. 23.The method of claim 16, wherein the task assigned to a logic blockcomprises maintaining least recently used (LRU) indications.
 24. Themethod of claim 16, wherein the task assigned to a logic block comprisesre-allocating a clean entry with a higher least recently used (LRU)indication when a new system memory address is to be cached.
 25. Themethod of claim 16, wherein the shared distributed metadata hosted bythe metadata store comprises tag bits, valid bits, and dirty bits. 26.The method of claim 25, further comprising: locking the valid bits anddirty bits of a given cache controller via a lock bit indicating thatthe valid bits and dirty bits of the given cache controller are not bechanged except by the given cache controller.
 27. The method of claim16, further comprising: upon completion of relevant transactions at agiven cache controller, updating an appropriate metadata store ofappropriate valid bits and dirty bits and cause a lock bit to becleared.